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Abstract 

Structure learning in random fields has attracted considerable atten- 
tion due to its difficulty and importance in areas such as remote sensing, 
computational biology, natural language processing, protein networks, and 
social network analysis. We consider the problem of estimating the proba- 
bilistic graph structure associated with a Gaussian Markov Random Field 
(GMRF), the Ising model and the Potts model, by extending previous 
work on h regularized neighborhood estimation to include the elastic net 
h + h penalty. Additionally, we show numerical evidence that the edge 
density plays a role in the graph recovery process. Finally, we introduce 
a novel method for augmenting neighborhood estimation by leveraging 
pair-wise neighborhood union estimates. 

1 Introduction 

Edge sparsity in an undirected graphical model (Markov Random Field) encodes 
conditional independence via graph separation. Essentially, graphical models 
detangle the global interconnections between the random variables of a joint 
distribution into localized neighborhoods. Any distribution P(X) consistent 
with the graphical model must abide by these simplifying constraints. Thus, 
the graph learning problem is equivalent to a model class selection problem. 
Let G = (V, E) be an undirected graph onp= |V(G)| vertices and m = \E(G)\ 
edges. Let X = (Xi, . . . ,X p ) e X v denote a random vector with distribution 
P(X), where variable Xi is associated to vertex i e V(G). Graphical model 
selection attempts to find the simplest graph, often dubbed the concentration 
graph, consistent with the underlying distribution. 

Recent work in graphical model selection exploits the local structure of the 
underlying distribution to derive consistent neighborhoods for each random vari- 
able. In terms of graphs, the neighborhood set of a vertex r is N(r) = {t e 
V(G) | (r, t) G E}. More importantly, for undirected graphical models, N(r) 
is the Markov blanket of r, where X r is rendered conditionally independent 
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of all other variables given N(r): X r _LL X\ ({ r }ujv(r)) I XnM- To estimate 
the neighborhood conditional probabilities P(X r \ X\ r ), these methods employ 



pseudo-likelihood measures, specifically l\ regularized regression: the lasso 11 



Compared to other l p penalty based regularization schemes, the l\ penalty en- 
joys the dual properties of convexity and sparseness by straddling the boundary 
between the two domains. By treating X r as the response variable and X\ r as 
the predictors in a generalized linear model, the l\ regularization penalty can 
recover an appropriately sparse representation of N(r). Reconstructing the full 
edge set of the graph using the estimated neighborhood N(r), allows for two 
alternate definitions: E A = {(a, b) : a e N(b) A b G N(a)} which we call AND; 
or E v = {(a, b) :ae N{b) V b € N(a)} which we call OR. 

Ravikumar et. al [8], J9] consider the problem of estimating the graph 
structure associated with a Gaussian Markov Random Field and the Ising 
model. Their main result shows that under certain assumptions, the prob- 
lem of neighborhood selection can be accurately estimated with a sample size 
of n = Q(d 3 logp) for high dimensional regimes where d is the max degree, and 
(p >> n). Note that the number of samples needed is further improved in [8] to 
Fl(d 2 logp) for GMRF model selection. Meinshausen et. al [7] also examine the 
GMRF case, and provide an asymptotic analysis of consistency under relatively 
mild conditions along with an alternate Ai penalty. 

We build upon this previous work by extending the l\ penalized neighbor- 
hood estimation framework to use the elastic net 13 l\ +I2 penalty and expand- 
ing the scope of graphical model recovery to include the multinomial discrete 
case. While the lasso performs beautifully in many settings, it has its draw- 
backs. In particular, when p > n, the lasso can only select at most n variables. 
Moreover, for highly correlated covariates, the lasso tends to select a single vari- 
able to represent the entire group. By incorporating the I2 penalty term, the 
elastic net is able to retain the lasso's sparsity while selecting highly correlated 
variables together. 

Additionally, we introduce a novel scheme for augmenting neighborhood 
recovery by pooling pair-wise neighborhood union estimates. The idea is to infer 
the joint neighborhood of a pair of nodes (not necessarily adjacent), and 

obtain the neighborhood of node i by combining all the information given by the 
n — 1 pairs of nodes containing node i. The frequency with which nodes appear 
in a specially designed neighbor list for node i gives us a weighted ranking of 
nodes in terms of their neighbor likelihood. This method can be combined with 
the usual neighborhood recovery to extract more information from a possibly 
insufficient set of samples. 



2 Problem formulation 



Undirected graphical models encode the factorization of potential functions over 
cliques, which in their most basic form, are comprised of 1st and 2nd order 
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interactions: functions that map node and edge values to the real line: 



P[X ^ = \ II <t>sMs,Xt)\l<i>u(x u ), 

(s,t)££ uEV 

which for max-entropy exponential family distributions, can be written as 



\(s,t)eE u€V 



Note that Z represents the normalization constant or partition function. 

For continuous random variables, the most common exponential family MRF 
representation is the multivariate Gaussian with sufficient statistics {X s ,X^\s e 

V}u{x s x t \( Sl t)eE}. 



P(X) = | exp ( e rX r + ^ E E ®stX s X t ) 
\rev sevtev J 



(1) 



The px p symmetric pairwise parameter matrix 0, known as the inverse covari- 
ance matrix of X denotes the partial correlations between pair of nodes, given the 
remaining nodes. Every edge (s, t) G E will have a non-zero entry in and each 
row s of specifies the graph neighborhood N(s). Conversely, the sparsity of 
reveals the conditional independencies of the graph where <d st — 0, V(s, t) ^ E. 
Conditional neighborhood expectations can be represented by a linear model: 
E(X S \X\ S ) = Y,teN(s) QstXf 

In the binary case, the MRF distribution can be described using an Ising 
model where X s e {-1,1}, Vs e V, and <j> st {X s ,X t ) = st X s X t . The full 
probability distribution takes the following form, which omits first order terms: 

P(X) = |exp Y, 6 stX s X t \ (2) 
\(s,t)eE ) 

The conditional neighborhood probability P(X S \X\ S ) is defined as: 

exp(2X s X)( s ,t)eB OstXt* 
<(s,t)eE 



cxp{2 x s j: (st)eE e at x t ) + i {6) 



Taking the Hessian of the local conditional probability gives the Fisher informa- 
tion matrix for X s , Much like partial correlations in the Gaussian concentration 
matrix, zero entries in the Fisher information matrix indicate conditional inde- 
pendence. 

Extending the discrete parameterization to variables with k > 2 states, re- 
quires an expansion in terms where the edge potential functions <j) now describe 
a set of parameterized indicator variables I(X S = l,X t = m) representing the 
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k 2 possible value pairs between X s and X t . 



/ k k k 

P(X) = - exp [J2 Yl °s-MX s =%)+ EE 6 st-.i m l{X s = l,X t = m) 



Z 

\s£V 1=1 (s,t)£E 1=1 m=l 



As described in 12 , this particular representation is over complete since the 
indicator functions satisfy a variety of linear relationships y\._-, I S (X S = i) = 1. 
However, despite the lack of a guaranteed unique solution, the factorization can 
still satisfy the desired neighborhood recovery criterion. A simplified variant 
of the general discrete parameterization is the Potts model where each <fi is 
defined by two indicator functions denoting node agreement and disagreement 
for arbitrary k > 2. We observe that in the Ising model, the form of <f> may be 
be recast as (j> st {X s ,X t ) = 6 st I(X a = X t ) ~ 9 st l{X s £ X t ). 

P(X) = | exp (^^0 s: J(A s =*)+ J2 O st l{X s =X t )-e st l{X s ^X t ) 

\seV i=l (s,t)£E / 

Note that the Potts model only requires a single parameter and generalizes the 
Ising model to k states. 

To extend neighborhood estimation from the binary Ising model case to a 
discrete parameterization, we note that the neighborhood conditional probabil- 
ity takes the form 

P(X -d\X ) - exp "^ s:d + ^(s,*)g£ ^"=1 d sf-dmKX s = d,X t = m)} 

V Ef=i exp{6» s:i + J2( s t )eE ELi O st :i m I(X s = I, X t = m)} 

(4) 

which is equivalent, after a variable transformation from a discrete feature space 
to indicators, to the classical multinomial logistic regression equation: 

p(x s = d i x v ) = *?M^mA (5) 

where Ajv( s ),; = {^{X s = l,X t = m) | (s, t) e E 1 } with an additional singleton 
indicator variable I(A S = I) always set to 1. With the conditional probability 
equations in hand, we can approach the problem of neighborhood estimation as 
a generalized linear regression. Building on previous model selection work using 
the l\ penalty, we extend the approach, to use the combined l± + I2 penalty 
approach of the elastic net [13] , which for the basic linear model takes the form: 

L(\ 1 ,\ 2 ,e) = \\x s -X\ s e\\l + *iW\\i + H\o\\l 

The elastic net performance surpasses the l\ penalty under noisy conditions and 
where groups of highly correlated variables exist in the graph. However, as noted 
by Bunea [2] , the additional 1% smoothing penalty should be small relative to the 
l\ term to preserve sparsity. Many authors have extended the elastic net penalty 
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to additional regression models, covering a broad swath of the generalized linear 
realm. For the linear Gaussian case, we use the original elastic net package of 



Zhou and Hastie 13 . For binary and multinomial regression we rely on the 
glmnet library of Friedman et. al [4j. 

3 Experimental Evaluation 

To evaluate the elastic net for Gaussian MRF model selection, we generate the 
distribution inverse covariance matrix O = in the following way. We set 
Qij = 0.5 whenever (ij) 6 E, and then perturb the diagonal of the matrix 
<du = r, with t large enough to force all eigenvalues of 6 to pe positive. We 
experimentally choose t, starting from 1 and increasing it in increments of 0.1 
until we get a value that makes positive definite. 

In the case of the binary and discrete models, we require a more complicated 
procedure based on MCMC sampling. However, given the size of our graphs, the 
direct Gibbs sampling approach proved to be computationally expensive because 
of its long mixing times and slow mode exploration when the temperature (the 
0's in our case) is low. 

To overcome this difficulty, we turn to the Swendsen-Wang algorithm. This 
method generates an augmented graph G — (V, E), where V — V U E and E 
contains (v, e) iff v € V and e G E are incident. Given this formulation, G is 
bipartite between the V nodes and E nodes. Thus in the joint distribution of 
G, the Markov blanket of E will only consist of elements in V and vice versa. 
The random variables assigned to E can only take the values and 1. 

We define the conditional probabilities of V and E as: 

• P(j(e = 1\V) is given by considering the nodes s,t 6 G s.t. e is incident 
with s and t. P(j(e = 1\V) = 1 — 2e~ 2J if x s = xt and otherwise. 

• Pq(V\E) is such that all nodes in the same component (in the graph when 
we consider only the edges e s.t. e = 1) have the same value, and each 
component takes each of the k possible values with equal probability. 

Essentially, the algorithm generates MCMC samples by alternately updating 
the values of V and E using Gibbs sampling. Although the augmentation sub- 
stantially increases the number of vertices, the algorithm creates a Markov chain 
that explores the space of outcomes much more rapidly. For the details of the 
Swendsen-Wang algorithm, we refer the reader to 10 and [6]. 



By introducing an li penalty term to the regression model, the maximization 
problem in the elastic net setup becomes 

qsMM = arg min \\X. - X0\\j + X^Wi + ^PWl (6) 

o:u B — U 

with A2 = c \/(^p)- We experimented with several values of A2 for different 
number of samples, and observed the type I and type II probability errors. 

In Figure [l] we show 3D plots of the total error rates as a function of the 
number of samples and the A2 parameter, for the AND and OR neighborhood 
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estimation. Several observations can be made from these plots. First note that 
larger values of A2 performs worse than smaller values, no matter what the 
sample size is. The best recovery rates are achieved when A2 is very small. 
Also note that the AND neighborhood selection perform much worse than its 
OR counterpart when A2 is large. The figure [2] plots the error rates versus the 




Figure 1: Total error rates as a function of A2 and number of samples n, in the 
AND neighborhood selection (left) and the OR neighborhood selection (right). 

sample size, for a fixed A2 = cy^(^p). Note that the chosen graph has 40 
vertices and is of maximum degree d = 25, and it cannot be recovered without 
errors even when the number of samples scales as 5cPlog(p). 




500 1000 1500 2000 2500 3000 3500 4000 4500 

n 

Figure 2: Error rates for a random graph, with p = 40, A2 = \J l ° 9 ^ , and 
n < 5d 2 log(p). 

The first family of graphs we tested were the star graphs and the more general 
clique of star graphs. We denote by Star(a,b) the clique-star graph obtained 
from a copies of a star graprj^] by connecting all star centers among them, or 
in other words we add d different neighbors to each vertex in a clique of size a. 
Star(l,b) is just the standard star graph. Note that the maximum degree in 
Star (a, b) is d = a + b — 1 and the total number of vertices is p = a(b + 1). 

A star graph consists of one node of degree q connected to the remaining q vertices of 
degree 1 
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For a graph G we let p{G) denote the edge density of the graph, i.e. p = 
n( 2 n-i) ' ^ ne reason we introduce this parameter in our simulations is to observe 
the impact of edge density on the recovery rates when the maximum degree 
and the number of samples are fixed. As our simulations show, recovering the 
graph structure is significantly harder when the graph has a higher density but 
the same fixed maximum degree d. To test this, we generate a star graph with 
maximum degree d and edge density p\ , and then start adding edges among the 
lower degree neighbors to obtain a new graph with new edge density p2 > pi- 
Note that this edge density dependence can be cquivalently formulated in terms 
of the average degree d of a graph by the formula d = (p — l)p. The error 
rates are averaged over 20 runs for a fixed graph G on different samples of size 

n. Additionally, A2 was chosen by discretizing the interval [0, y^p] into 15 
equally sized subintervals. 

Figure |3] plots the error recovery rates for G\ — Star (1,24) with p\ = 0.16 
(top) and for the graph obtained by adding edges to G2 = Star (I, 24) until 
p2 = 0.76(bottom), with both graphs having the same maximum degree d = 24. 
Note that for these graphs d 2 log(p) w 1800 and as seen in the top plot of Figure 
[3j a bit over 1000 samples are enough to bring the error rates to zero. However, 
the bottom shows that even with 10 times more samples, we can only recover 
G2 with a 0.30 error rate. We repeat the above experiment for the clique-star 
graph Hi = Star(6,A) with p — 30 and edge density p\ = 0.18, and the graph 
H2 obtained by adding edges to Hi while keeping the maximum degree d = 9 
unchanged. In this case, d 2 log(p) — 275 and we successfully recover Hi only 
when the sample size exceeds 1400, due to higher edge density (plot omitted). 
Figure [4] shows the error rates when we increase the edge density to pi = 0.3, 
which emphasizes the increase in sample size required for graph recovery. Note 
that in both simulations the A2 penalty was rarely of any help, and in most 
cases A2 = achieved the best error rates. 

A second type of graph we considered was the community graph, denoted 
by Com(s,t, /3 in , /3 out ), which consists of s groups of highly connected nodes, 
where each group has size t, so p = st. Two vertices within the same group 
(or community) are connected with probability /3j n , while nodes that belong to 
two different communities share an edge with a smaller probability /3 out . These 
community structures are a common feature of complex networks, and have the 
property that nodes within a group are much more connected to each other than 
to the rest of the network (for (3 in > /3 OM t). In application, these communities 
may represent groups of related individuals in social networks, topically related 
web pages or biochemical pathways, and thus their identification is of central 
importance. To completely understand the modular structure of such graphs, 
one should be able to both detect overlapping communities and make meaningful 
statements about their hierarchies [5] . Figure [5] plots the error rates in the 
recovery of a Com(A, 8, 0.8, 0.15) graph with p — 32, d = 15, and p = 0.28. 
d 2 log(p) — 780, however again even with 9000 samples, the error rate is still 
over 10 percent. When few samples are available, A2 = achieves the best 
error rates, but as we increase the number of samples we notice that the elastic 
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Figure 3: Error recovery rates (y-axis) versus the A2 parameter (a;-axis) for the 
graphs Gi = Star (1,24) with pi = 0.16 (top) and G 2 with p 2 = 0.76 (bottom). 




Figure 4: Error recovery rates (y-axis) versus the A2 parameter (x-axis) for the 
graph H2 with p\ = 0.30. 



net method with A2 > performs slightly better than when A2 = 0. While this 
improvement is not significant, it hints that the additional l 2 regularization may 
produce better results in some cases. 



4 Discrete MRF recovery 

Figures [6] and [7] depict the performance of the elastic net neighborhood esti- 
mator over a range of discrete MRF graphs. The graphs evaluated for the 
Ising and Potts model are random graphs with bounded maximum degree. All 
experiments were run over a sample range covering the d 3 \ogp edge recovery 
threshold and a values ranging from 0.5 to 1, where a = t^+t^- Results from 
multiple trials were averaged for the Ising model. Due to the computational 
load of the glmnet multinomial regression for large data sizes, only single run 
results are shown for the Potts model. Smaller a values are omitted from the 
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n = 780 n = 3120 n = 9360 
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Figure 5: Error rates for the community graph Com(4, 8, 0.8, 0.15) with p = 32, 
d = 15, and p = 0.28 



plot in order to limit the scale and improve clarity. Unless otherwise noted, 
the plots represent AND neighborhood unions, with OR neighborhood unions 
showing similar performance. 

As seen in the plots, the elastic net neighborhood estimator recovers the 
underlying graph with high probability under corresponding fl(d 3 logp) sample 
sizes, validating our formulation of the discrete model neighborhood estimation 
as a multinomial logistic regression. Similar to the Gaussian case, the effect of 

the l 2 penalty A 2 , (while Ai is set to J l ° s ^ ) tends to benefit neighborhood 
recovery mostly at small sample values and when a is close to 1. Oversized li 
penalties introduce an inordinate number of noise edges, but small l 2 penalties 
reduce the chance of missing edges with weak correlation which the l\ penalty 
rejects. When the I2 penalty is non-zero, the minimization function is strictly 
convex and allows the estimator to select additional nodes that exhibit highly 
correlated behavior by effectively averaging their contribution. 

Ising model p = 64, d = 10, Type I error rate Ising model p = 64, d = 10, Type II error rate 




Figure 6: Type I and II error rate for an Ising model graph with 64 nodes 
and max degree 10. While a ~ 1 minimizes FPR across sample sizes, the 
conservative nature of the l\ penalty induces false negatives even for large sample 
sizes. Choosing an a < 1 increases the likelihood of recovering additional edges 

The A2 parameter provides, in essence, a trade-off between precision and 
recall, as it can be seen in Figure [HJ especially when the number of samples is 
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Potts model p = 64, d = 1 0, k = 4, Type I error rate Potts model p = 64, d = 1 0, k = 4, Type II error rate 




Figure 7: Type I and II error rate for a Potts model graph with 64 nodes, max 
degree 10, and 4 states. When a = 1 the FPR is minimized over all sample 
sizes. However, again, the l\ penalty induces false negatives for small sample 
sizes 



small, which is the case in many high dimensional p»n applications. While 
the a = 1 curve consistently provides the highest graph recovery precision over 
all sample sizes, the actual number of recovered edges may be extremely limited 
due to the sparsity constraint. For the Ising model graph depicted in the figure, 
the smallest sample size 1200 with a — 1 gives a precision of 0.88 with a recall of 
0.7. By introducing a A2 term with a = 0.8, the precision drops to 0.8 but the 
recall improves to 0.79, which is better than a 1 to 1 trade-off. As expected, with 
large sample sizes, the benefit of the A2 parameter diminishes as shown by the 
nearly vertical slope of the large sample size curves. Similarly, the Potts model 
recall-precision plot also displays this trend, albeit in a compressed fashion since 
the neighborhood estimator is able to recover the graph at a smaller sample size, 
rendering the larger sample size curves uninformative. From these results we can 
say that the additional presence of an I2 penalty may yield substantial benefits 
for p n situations where the goal is to extract relevant correlation information 
from small sample sizes. 



5 Neighbors of pair of vertices 

The technique introduced in this section is useful in neighborhood reconstruc- 
tion especially in the case of regular graphs, graphs with a small gap between 
known maximum and minimum degree, or when we would like to obtain a like- 
lihood ranking of the p — 1 possible neighbors of a fixed node i. The idea is to 
infer neighborhoods not only for one vertex at a time, but for pairs of vertices, 
which may or may not be adjacent. We denote by A/' 1 (i) the estimation of the 
neighborhood of node i, as given by the optimization in equation ([6]). 

We denote by Afij the set of neighbors of nodes i and j, i.e. Afij = {v 6 
V(G)\{i,j} : (i, v) £ E or (J, v) € E}, in other words Afij is the union of the 
neighborhoods of nodes i and j, minus the edge if it exists. We now define 
the following optimization problem similar to the one in equation |6]) 
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Recall vs. Precision (Islng, p = 64) 



Recall vs. Precision (Potts, p = 64) 




Figure 8: Recall-Precision curve for Ising and Potts model with 64 nodes and 
max degree 10. Horizontal curves correspond to a values varying over sample 
sizes represented by the vertical dashed black curves increasing to the right. The 
left most vertical curve corresponds to sample size 1200. Although the a = 1 
pure l\ penalty dominates precision performance, choosing a smaller a permits 
a trade-off in precision for recall that is particularly beneficial for small sample 
sizes 



0i jj)\ u \ 3 := arg min \\Xi-xe 1 +x j -xe J \\l + x 1 \\d I \\i 

e 1 ,e J ':0f=o,0^ =o 

+X2\\0 I \\l + X 1 \\e J \\ 1 + X 2 \\9 J \\l (7) 

After grouping the terms of Ai and A2 and approximating ||0^|]i + H^lli by 
WO 1 + J ||i, and H^Hi + ||6> J ||^ by \\9' + J ||§, we approximate ^ by 

arg min || {X t + A,) - X9\\ 2 2 + X x \\9\\ l + X 2 \\0\\ 2 2 =: 9^ JMM (8) 

where in the last step we make the change of variable 9 = 6 1 + 9 J and we 
denote by 9 I,J ' Xl ' X2 the regression coefficients of the sum of variables / and J 
against the remaining variable. We now define the estimated neighborhood of 
a pair of vertices (i, j), not necessarily adjacent, to be AAy ={u6 V(G)\{i, j} : 
JJ,Al ' A2 (u) 7^ 0}, in other words Nij is an estimate of ftfij. Note that for a 
vertex v G Afij it may be the case that v is adjacent to either i or j or perhaps 
both. 

For a fixed node i, we obtain its neighborhood in the following way. We let 
Ci denote the list obtained by concatenating the pair neighborhoods of node i 

A = |jA^ 

where |J denotes union with repetitions. We denote by Ci the concatenation of 
the estimated pair neighborhoods, i.e. Ci — Uj^j-^i- Note that Ci includes 
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all nodes that are neighbors of i, each appearing with multiplicity p — 2. If j is 
a neighbor of i, then j £ Nik for all k ^ j, j, i.e. p — 2 times. In the absence 
of errors A/y = A/ij, A = A; an d with the exception described in the next 
paragraph, we can correctly recover the neighborhood of node i by picking the 
most frequent elements from Ci, i.e. all nodes which appear in the list exactly 
p — 2 times. In the case of errors, we obtain an estimate for the neighborhood 
of node i by selecting the most frequent elements in Also, if there are no 
errors, d contains all other nodes j ^ i of G at least once. This is obvious if 
j ~ i, as j appears p — 2 times in Ci as explained above. If j ^ i, then pick k 
a neighbor of j (k exists since we assumed G is connected) and it must be that 
j € N ik C Ci. 

Note that a non-neighbor node j of i can appear n — 2 times in £^ if j is 
connected to all nodes in V(G)\i, in which case we (incorrectly) add j to the 
neighborhood of node i. Similarly, if i is connected to all nodes in V(G)\j, then 
i appears in Cj with multiplicity p — 2 and we (incorrectly) mark i as being 
in the neighborhood of node j. In other words, i and j appear in each other's 
neighborhood lists, thus rendering our approach incorrect, whenever i and j are 
both connected to all other p — 2 vertices in the graph. However, for random 
graphs this scenario occurs with a very low probability. 




Figure 9: Af 2 = {1,6,7,8,11,17,25,29,30}, Af 2 

{1,6,7,8,11,15,17,25,29,30}, C 2 = { 30(33), 11(33), 25(32), 17(32), 
7(31), 8(30), 29(29), 1(27), 6(26), 15(22), 35(18), 32(17), . . .} where bold- 
face indicates true neighbors and numbers in parentheses denote frequencies in 
C 2 

As shown in Figure [9j true neighbors of node i occur most frequently in 
Ci. When ordering vertices in Ci based on their frequency, most of the true 
neighbors appear at the top of the list. However, errors occur and false neighbors 
sometimes precede true neighbors. Note however that there are cases when 
the single neighborhood estimation performs much worse and omits many true 
neighbors. 

We have seen how histograms based on neighborhoods of pairs are useful 
in determining a likelihood ranking of possible neighbors of a given node. The 
top ti most frequent elements in Ci are the most likely neighbors of i. The 
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problem now becomes how to select this threshold value U for each list. If U is 
too small then true neighbors might be left out, and if t% is too big then we will 
introduce false neighbors. We make an additional observation that improves on 
the accuracy of the above ordering obtained from Ci. Denote by L the matrix 
formed from lists Ci by letting equal the frequency of node j in list £j . To 
incorporate the symmetry between two neighboring nodes: if i is a neighbor of 
j then j is also a neighbor of i, we build the symmetric matrix S = L + L T . 
The intuition here is to average out the votes received by nodes i and j in their 
respective rows. Suppose that € E, but j does not rank highly in Ci. 

However, it may be the case that i ranks highly in Cj, and helps in identifying 
that (i, j) are indeed neighbors in G. One can think of this method as averaging 
out the bad information (noisy edges) and boosting up the good information 
(correct edges). 

Finally, another alternative would be to first row normalize L and then 
construct the symmetric matrix described above. We divide each row in L by 
the largest entry in that row and obtain the row stochastic matrix L, whose row 
i may be thought of as ranked probabilities of the possible neighbors of i. We 
then build the row stochastic matrix S = \(L + Z T ), as described above. 

As mentioned earlier, the main problem is finding the threshold U for each 
row i to separate the neighbors for non- neighbors of i. If we know a priori what 
the degree of each node is, then one way to pick the neighbors would be to select 
the most frequent di entries in Alternatively, if we know that the graph is 
almost regular of degree r, or in other words that the average degree of the 
graph G is r but the degree distribution has very little variance around r, then 
we can again select the top r most frequent entries in Ci. Another way one can 
choose a threshold is to plot the frequency values in order and look for a big 
jump in the graph. This idea is illustrated in the right plot of Figure [9] Note 
how the frequency values decreases suddenly within two steps from 26 (for node 
6) to 22 (for node 15) and then to 18 (for node 35). Such large sudden drops in 
the ordered list of frequency values hint at a good threshold point. 

Table [T] shows the results of an experiment that illustrates the above ideas 
when we have oracle information about the degree of each node. We observe that 
the symmetric matrix S works better than just using L, and that S performs 
better than S. Note also that the type I and II errors are now more balanced 
both in the AND and OR case. When doing the estimation TV 1 , in the AND 
case almost all the errors were coming from missing edges, while in the OR 
scenario almost all errors were given by false edges. Picking the threshold t t 
allows for a trade-off between these two type of errors. 

It would be interesting to compare the results of our new pair neighborhood 
recovery to the original single neighborhood method, when both techniques 
take into account the knowledge of node degree. One way to incorporate this 
information into the single neighborhood method is to avoid using 10-fold cross 
validation and replace it with the following procedure. We have seen that when 
regressing variable i against all other variables, the elastic net method does not 
return a single Oi vector corresponding to the optimal Ai value, but rather com- 
putes an entire matrix (or sequence) where each row corresponds to a value of 



13 



Method 


AND (Error type) 


OR (Error type) 


I 


II 


Total 


I 


II 


Total 


N 1 


0.04 


0.58 


0.63 


0.12 


0.36 


0.48 


N*,L 


0.04 


0.27 


0.31 


0.32 


0.09 


0.41 


N 2 ,S 


0.07 


0.17 


0.24 


0.15 


0.05 


0.20 


Af 2 ,S 


0.05 


0.14 


0.19 


0.135 


0.045 


0.18 



Table 1: A plot with the type I, type II, and total error probability for different 
ways of estimating the neighborhoods: Af 1 is obtained by the original method, 
and the remaining three rely on histograms of neighborhoods of pairs when then 
node degrees are given. 

Ai at which an additional variable "turns on" . Instead of picking the empirically 
optimal row with the 10-fold cross validation method as our 9i vector, we can 
simply pick the first row which has r» nonzero entries, where r» is the degree of 
node i. One can also interpret the order in which the remaining p — 1 variables 
"turn on" as a way of ranking the potential neighbors. A variable which "turns 
on" sooner on the elastic net path is more likely to be a neighbor of i than a 
node which activates later on. This ranking can also be obtained by our pair- 
wise neighborhood union estimate, however we produce more than just a simple 
ranking. The frequency of each node in Ci combines additional information, 
and one can interpret this ordering as a weighted ranking of possible neighbors 
of i. This additional information may capture longer range correlations that 
elude single neighborhood estimation, as suggested in a recent paper of Bcnto 
and Montanari [I] . 

6 Conclusion 

In this paper we considered the problem of estimating the graph structure asso- 
ciated with a GMRF, the Ising model and the Potts model. Building on previous 
work using l\ penalized neighborhood estimation, we experimented with the an 
additional li penalty term. Simulations across the three models show that a 
small but non-negligible A2 penalty term improves the edge recovery rates when 
the sample size is small by trading precision for recall. We make the observa- 
tion that in the GMRF model, the addition of the I2 penalty term does not 
have much influence on the recovery rates. Numerical simulations confirm our 
hypothesis that the lower bounds on the number of samples needed for recovery 
should not only be a function of the maximum degree d and number of nodes p, 
but also of the edge density p (or equivalently the average degree of the graph) . 
We also introduce a new method for improving the neighborhood recovery by 
considering pair-wise neighborhood unions which produce a ranking of p — 1 
nodes in G with respect to their likelihood of being adjacent to the remaining 
node. This can be thought of as a way to incorporate local information (rank- 



14 



ings) at each node into a globally consistent edge structure estimation of the 
graph G. 
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